1.Linear regression is a method for predicting outcomes. Identifying the value of two coefficients aids in the creation of accurate forecasts. Linear Regression is simple to use and provides reliable predictions. Still, Logical Regression is similar to linear regression in that it assists in determining the values for two coefficients that balance each input variable. The sole difference between the two is that Logical Regression uses a logical, non-linear function instead of a linear function to solve binary classification issues. Unlike linear regression, logical regression may provide explanations for predictions. Regression problems are solved using linear regression while logistic regression can be used to solve both classification and regression problems, it is most commonly utilized to solve classification difficulties.

2.The response variable (y) is a random variable while the predictor variable (x) is assumed non-random or fixed and measured without error

3.We make data simpler to read and utilize by preparing it. This procedure removes data discrepancies or duplicates that may otherwise degrade the accuracy of a model. Data preparation also guarantees that no inaccurate or missing values are included due to human mistakes or defects.

4.When your model memorizes the data without truly understanding/interpolating a generic function that would apply to any external data, this is known as overfitting. In other words, it develops a function that is tuned primarily to the training dataset. Hence, it performs near-perfectly with the training data but fails miserably with data it hasn't seen before. When there isn't enough data for the model to create/interpolate a generic function that defines the process, this is known as underfitting. As a result, the model performs poorly for most input data.

a. Features:

  1. age
  2. anaemia
  3. creatinine_phosphokinase
  4. diabetes
  5. ejection_fraction
  6. high_blood_pressure
  7. platelets
  8. serum_creatinine
  9. serum_sodium
  10. sex
  11. smoking
  12. time
  13. DEATH_EVENT

b. Categoricals:

  1. Anaemie
  2. Diabetes
  3. High_blood_pressure
  4. Sex
  5. DEATH_EVENT
  6. Smoking

because anything that considered binary is CATEGORICAL,by categorical means small amount of value

c. Continuous:

  1. Age
  2. Creatine_phosphokinase
  3. Ejection_reaction
  4. Platelets
  5. Serum_creatine
  6. Serum_Sodium
  7. Time

Because anything that has finite amount of value will consider as continuous

In this Project, we have to predict when people will have mortality caused by heart failure. We will use 5 different plots to visualize and see what the data tells us. So, If we use only one model, we can get not high accuracy therefore we need to use 2 different models to compare each other to get a more accurate prediction about mortality caused by heart failure

  1. No, there are no missing values because every column has 299 non-null
  2. No, we do not have any NULL values
  3. No, we do not have any missing values
  1. age=47
  2. anaemia=2
  3. creatinine_phosphokinase=208
  4. diabetes=2
  5. ejection_fraction=17
  6. high_blood_pressure=2
  7. platelets=176
  8. serum_creatinine=40
  9. serum_sodium=27
  10. sex=2
  11. time=148
  12. DEATH_EVENT=2
  13. Smoking =2

PLOTs

so the boxplot of ejection_fraction tells me that median of it is like ~38 and minimum is 30 with maximum 80

so the boxplot of serum_creatinine tells me that median of it is likely ~1 with maximum more than 8

so the boxplot of Time, it tells me that median of it is likely ~115,minimum is ~75 and maximum is more than 250

Plot -PIE CHART

PIE CHART OF SEX VS DEATH EVENT</h2

CONCLUSION FOR SEX VS DEATH EVENT

As we can see in the first pie chart, our data has more Males than Females. Every 2/3 person in your data is identified as a Male, and in the second pie chart, we can see that Male Survive and Die more than females.

CONCLUSION FOR SEX VS ANAEMIA

As we can see in the first pie chart, our data has more Males than Females. Every 2/3 person in your data is identified as a Male, and in the second pie chart, we can see that Male are more Anemia and Non-Anemia than females.

CONCLUSION FOR SEX VS DIABETES

As we can see in the first pie chart, our data has more Males than Females. Every 2/3 person in your data is identified as a Male, and in the second pie chart, we can see that Males have more Diabetes than females.

CONCLUSION FOR SEX VS HIGH BLOOD PRESSURE

As we can see in the first pie chart, our data has more Males than Females. Every 2/3 person in your data is identified as a Male, and in the second pie chart, we can see that Males have more high blood Pressure than females.

CONCLUSION FOR SEX VS SMOKING

As we can see in the first pie chart, our data has more Males than Females. Every 2/3 person in your data is identified as a Male and in the second pie chart, we can see that Males and Females smoke almost equally.

So we have created distribution plot to all our numerical variables to see their Densities

HEATMAP PLOT

This HeatMap tells that it does not have a Multicollinearity because the index of any of these cells is not higher than 0.5

SCATTERPLOT

Conclusion

As we can see in our Scatterplot above that more patients deceased during the follow-up period compared to who remained alive

  1. Yes, it is necessary to scale the data since The data is scaled to make it easier for a model to learn. The benefits of scaling:

    a. It speeds up the training process.

    b. It prevents the optimization from becoming stuck in a local best-case scenario.

    c. It improves the form of the error surface. and comprehend the situation.

  2. I have used MinMax Scaler for this project
  3. Yes, features are scalled

We use pd.to_numeric(DATASET["X"],errors='coerce').notnull().all() to see whether variables numberic

1.We do not need to modify any variables

  1. Difference between Parametric and Non-Parametric algorithms: Parametric model assumes that a probability distribution with a given set of parameters can accurately model the population Non-parametric model makes no assumptions about the probability distribution when modeling data. For the models you are choosing- are they parametric or nonparametric? Explain. I hava two Models: Linear Regression and Logistic Regression. Both my models are Parametric because both have a fixed number of parameters, computationally faster but make assumptions.

  2. Label encoding converts a categorical variable into integers, with a number between 0 and num classes-1 depending on the number of classes. One-Hot replaces the category variable with num classes variables, of which one is 1 and the others are 0. When the variable is nominal and there is no order, we utilize one-hot. When there is a natural order to the data, we use label encoding.

First MODEL:Logistic Regression

Build Logistic Regression with Hyperparameter

Part 4- Model

Second MODEL:Linear Regression

Linear Regression Regularization

Part 5- Conclusion & Analysis

  1. Without using the Tuning Hyperparameters, my accuracy scores keep getting different results. In this Project, Logistic Regression gave 0.64 accuracies while my Linear Regression gave me 0.434, which is below 0.6, considered a bad result.
  2. Logistic Regression won in this comparison with 0.64 accuracies.
  3. I can expand my work by creating different plots or adding different models with Turning Hyperparameters